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Abstract 


Link streams model the dynamics of interactions in complex dis- 
tributed systems as sequences of links (interactions) occurring at a 
given time. Detecting patterns in such sequences is crucial for many 
applications but it raises several challenges. In particular, there is 
no generic approach for the specification and detection of link stream 
patterns in a way similar to regular expressions and automata for text 
patterns. To address this, we propose a novel automata framework 
integrating both timed constraints and finite memory together with a 
recognition algorithm. The algorithm uses structures similar to tokens 
in high-level Petri nets and includes non-determinism and concurrency. 
We illustrate the use of our framework in real-world cases and evaluate 
its practical performances. 


Keywords: Timed pattern recognition, difference bound matrices, 
finite-memory automata, timed automata, complex networks, link 
streams. 


1 Introduction 


Large-scale distributed systems involve a great number of remote entities 
(computer nodes, applications, users, etc.) interacting in real-time following 
complex network topologies and dynamics. One classical way to observe the 
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behavior of such complex system is to take snapshots of the system at given 
times and represent the global state as a very large and complex graph. The 
behavior of the system is then observed as a timed sequence of graphs. The 
algorithmic detection of patterns of behaviors in such large and dynamic 
graph sequences is a very complex, most often intractable, problem. The link 
stream formalism [14, 19] has been proposed to model complex interactions 
in a simpler way. A link stream is a sequence of timestamped links (t, u,v), 
meaning that an interaction (e.g. message exchange) occurred between u 
and v at time t. The challenge is to develop analysis techniques that can be 
performed on the link streams directly, without having to build the underlying 
global graph sequences. The patterns of interests in link stream involve both 
structural and temporal aspects, which raises serious challenges regarding 
the description of such patterns and the design of detection algorithms. 
The problems has been mostly approached from two different angles. First, 
recognition algorithms have been developed for specific patterns such as 
triangles in [15]. The focus is on the performance concerns, involving non- 
trivial algorithmic issues. At the other end of the spectrum, complex event 
processing (CEP) has been proposed as a higher-level formalism to describe 
more complex interaction patterns in generic event streams [1, 23]. These 
generic works do not handle the specificity of the input streams. For example, 
the real-time and graph related properties are of particular interest in link 
streams. Our objective is to develop an intermediate approach, generic 
enough to cover a range of interesting structural and temporal properties, 
while taking into account the specificities of the abstraction under study, 
namely the link streams. 


Our starting point is the regular expressions and finite state automata 
for the recognition of patterns in texts. The main idea is to interpret 
link streams as (finite) words and develop a pattern language involving 
both structural and temporal features. This leads to a new kind of hybrid 
automata, the timed v-automata, as recognizers for this pattern language. 
They are built upon finite state automata (FSA) with both timed [2, 3] and 
finite-memory [13, 8, 9] features. The patterns themselves can be specified 
by enriched regular expressions, and ”compiled” to timed v-automata. The 
problem of timed pattern matching has been addressed only quite recently 
in e.g. [17, 18, 21, 22]. While our model bears some resemblance with these 
propositions, we adopt a generalized approach to temporal patterns based 
on the difference bound matriz (DBM) abstraction [10]. Moreover, we study 
pattern matching in the presence of real-time constraints together with finite 
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Figure 1: A link stream (left) and its graph projection in time interval [8, 15] 
(right). 


memory. To our knowledge this has not been addressed in the literature. 

One interesting aspect of the automata model we propose is that the 
recognition principles are based on a non-trivial token game. Indeed, our 
main inspiration comes from high-level Petri nets. Based on this formalism, 
we developed a prototype tool that we applied to real-world link streams 
analysis. Performance issues are raised but the results are encouraging. In 
particular, our experiments confirm the following key fact: timing properties 
often help in reducing the performance cost induced by storage of information 
in memory. 

The present paper is an extended version of [5] with a generalization of 
the timed model using the DBM abstraction. The outline of the paper is 
as follows. In Section 2 we introduce the principles of finding patterns in 
link streams. The automata model and recognition principles are formalized 
in Section 3. The Section 4 is entirely new and describes a timed pattern 
matched based on difference bound matrices. The pattern languages, the 
prototype tool we develop and a few experiments are then discussed in 
Section 5. 


2 Patterns in Link Streams 


We consider link streams [14] defined as sequences of triples (t;, u;, vi), 
meaning that we observe a link between nodes u; and v; at time t;. Figure 1 
(left) shows an example of a link stream that models interactions between 
nodes a, b, c and d. For example at time t = 6 a link from node d to node 6 
is observed, which corresponds to a triple (6,d,b) in the stream. 

A pattern in such a link stream can be seen as a series of (directed) 
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subgraphs observed in a given time frame. For example, at time t = 15 we 
observe the subgraph described on the right of Figure 1. This graph has 
been formed in the depicted time frame of 7s. One trivial way to detect 
such patterns is to build all the intermediate graphs and solve the subgraph 
isomorphism problem at each time step. This is however out of reach in most 
situations most notably because: (1) real-world link streams involve very 
large graphs, and (2) subgraph isomorphism is an NP-complete problem. 
Hence, in practice dedicated algorithms are developed for specific kinds of 
subgraphs. One emblematic example is the triangle for which specialized 
algorithms have been developed. A triangle is simply the establishment of a 
complete subgraph of three nodes, in a directed way. In network security 
this is a known trigger for attacks: two nodes that may be identified as 
” attackers” negotiate to ”attack” a third node identified as the ”target”. 
Such a trigger can be observed in Figure 1 (right) with a and d attackers 
targeting b. In real-world link streams, detecting such triangles is in fact not 
trivial, as explained for example in [15]. 

In this paper, our motivation is to develop a more generic approach able 
to handle not only such triangles but also other kinds of patterns: directed 
polygons, paths, alternations (e.g. links that appear periodically), etc. We 
also require the matching algorithms to be of practical use, hence with 
efficiency in mind. Our starting point is the theory of finite-state automata 
(FSA) and regular expressions. Indeed, if we ignore the timestamps, a link 
stream is similar to a finite word, each symbol being a directed link (a pair 
of nodes). For example in the time frame (8, 15) we observe the following 
” word”: 


(a, d)(d, b)(a, c)(a, b)(c, b). 


Based on such a view, we can use FSA as pattern recognizers and 
regular expressions as a high-level specification language. A regular pattern 
for the triangle example is as follows: 


(((a +d) | (da) (a+ 4) @ (d+ b))) @(@> @)" 


This expression uses classical regular constructs such as concatenation -, 
disjunction |, the Kleene star * and shuffle @. The symbol @ is used as a 
placeholder for any possible node, hence (@ — @) means ” any possible link”. 
Based on such specification, it is easy to build a finite-state automaton to 
recognize the triangles in an untimed link stream very efficiently. 

However, the ”regular language” approach fails to capture the timing 
properties of link streams. What we need is a form of real-time pattern 
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matching. Quite surprisingly, there are very few research works addressing 
this problematic, despite the broad success of timed automata [2] in general. 
An important starting point is the timed regular expressions formalism {3}. 
The basic principle is to interpret input words, hence link streams, as timed 
event sequences: a succession of either symbols or delays corresponding to 
a passage of time. Below is an example of a link stream as a timed event 
sequence: 
(a, b)2(d, b)2(a, c)1(a, d)1(c, b). 


A timed regular expression for the triangle pattern can then be specified, 
€.g.: 


(((a + 4) | (d+ a))- ((a 4 8) 8 (d+ d))o.) @ (@> BY 


The delay construction (S)j,) says that the subpattern S must be detected 
in time interval [x,y]. For the triangle pattern it means that the nodes a 
and d are only observed as ”attacking” target b if they simultaneously link 
to 6 in the time interval of one second. 

Another fundamental aspect that we intend to capture in link stream 
patterns is that of incomplete knowledge. In classical and timed automata, 
symbols range over a fixed and finite alphabet. In link streams, this means 
that the nodes of the graphs must be known in advance, which is in general 
too strong an assumption. In an attack scenario, for example, we must 
consider an open system: it is very likely that only the target is known in 
advance, and the two attackers remain undisclosed. 

The kind of pattern we intend to support is e.g.: 


(EX + HY) -((X > 8) @ (¥ + d))o,y) @(@ > @)*. 


In this pattern, the variables X and Y represent unknown nodes correspond- 
ing to two “attackers”. The construction {X means that the input symbol 
(hence node) associated to X must be fresh, i.e., not previously encountered. 
In case of a match this node is associated to X and kept in memory. With 
the operator X! (the dual of {X), after matching a value associated to 
variable X, all the values associated to it are discarded (i.e., the associated 
set is cleared). 

The sub-pattern ({X — #Y) describes a link between two fresh nodes. 
Note that since Y is matched after X, the freshness constraints impose that 
its associated node is distinct from the one of X. To match the sub-expression 
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Figure 2: Automaton for (@ — @)* @ ((a > b)Va > 4x- (XxX! > 
EX)" = (X! > b)) oy 


(X — b), the input must be a link from the node already associated with X 
in memory to node b. This is a potential attack on the target b. 

To handle such dynamic matching, one must consider a (countably) 
infinite alphabet of unknown symbols. This has been studied in the context 
of quasi-regular languages and finite memory automata (FMA) [13]. In 
this paper, we build upon the model of v-automata that we developed in a 
previous work [8, 9]. It is a variant of FMA, which is tailor-made for the 
problem at hand. If compared to the classical FMA model, the v-automata 
can be seen as a generalization to handle freshness conditions [16]. 

The resulting mixed model of timed v-automata is quite capable in 
terms of expressiveness. The automaton formalism is a combination of both 
the timed constraint and clocks reset from timed automaton and the memory 
management of the v-automaton. As an illustration, Figure 2 depicts an 
automaton that detects in a link stream all the paths from a node a toa 
node b such that each link is established in at most one second. We suppose 
that the automaton is defined for the alphabet © = {a,b}, i-e., only the 
nodes a and 0 are initially known. The labels v.X,X and X,vX are the 
automata variants of the operators {X and X! discussed previously. An 
example of an accepting input is: 


(a,y) 0.1 (y, z) 0.3 (y,b). 
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Initially, in state gg the known symbol a is consumed while transiting 
to state q. The unknown symbol y is saved in the memory associated 
to variable X while transiting to state gg. This only works because the 
symbol y is fresh, i.e., not previously encountered. The delay of 0.1 second 
is consumed in state q2 while increasing the value of the clock c to 0.1. The 
state constraint c < 1 is still satisfied. The next input y may either lead to q3 
(because it was previously associated to X) or q7 (because the symbol @ 
accepts any input). The recognition principle is non-deterministic so both 
possibilities will be tried: 


: ae X,0X , j ‘ 
e if the transition gg ——> q3 is taken, X is no longer associated to any 
symbol in qg3. The next input is the unknown symbol z. From qs, only 


be X,X ; : 
the transition q3 a q2 is enabled. In gg the variable X would be 
cE|0,1 


, 


associated to z. However, this path is doomed because the next (and 
last) link does not start from z. Then at the end of the input sequence 
the path leads to state gg which is not a terminal state. 


e if the transition qo = q7 is taken then the value associated to X is not 
discarded and the input z leads back to the state qo through transition 


q7 = q2. The input 0.3 increases the clock value to c = 0.4. The next 
input y may again lead either to g7 or q3 as in the previous case. In 
state q3 the input b enables only the transition q3 aa ga, which 

cE [0,1 
leads to the final state qy (since b € &). 


We reach an accepting state because the clock value c = 0.4 is still 
under 1 second. On the other hand, if the second delay is not 0.3 but e.g., 1.0 
then the link stream is not recognized because of a timeout in state qo. 


3 Automata Model and Recognition Principles 


The automata model we propose can be seen as a layered architecture with: 
(1) a classical (non-deterministic) finite-state automata layer, (2) a timed 
layer (based on [3]) and (3) a memory layer (based on [9]). These layers are 
obviously dependent but there is a rather clean interface between them. 
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3.1 The Timed v-Automata 


Definition 1 A timed v-automaton is a tuple: 


A= (%,Q, 90, FA, ony U,V ) 
en atl ai at 


finite-state timed memory 


The basic structure is that of a finite-state automaton. We first assume 
a finite alphabet of known symbols denoted by %. The finite set Q is that of 
locations®. The initial location is qq and F is the set of final locations. The 
component A is the set of transitions (explained in details below). 

This basic structure is extended for the timed constraints with a set C of 
clocks (ranging over co,ci,...) and a map [ that associates to each location 
a timed constraint. A transition can also be annotated with time constraints 
to restrict its firing. The grammar of timed constraints, identical to [2], is 
as follows: 


Definition 2 (Time constraints grammar) 
yu=yAylar~n|a-a~n 
where cy and c2 are clocks, n is a constant in Q and ~€ {=,<,>,<,>}. 


The memory component is a finite set V of variables (ranging over 
X,Y,...) for the memory constraints. Each variable will be associated to a 
(possibly empty) set of unknown symbols ranging over a countably infinite 
alphabet denoted by UU. These symbols are all the symbols that may appear 
in an input sequence, which are not in ©. Unlike FMA, which are limited 
by the number of their registers, the v-automata use variables of dynamic 
size, which allows to recognize words composed of an arbitrary number of 
distinct unknown symbols. 


Definition 3 A transition t € A of a timed v-automaton is of the form: 


q——— 4 
YP 
3The notion of a location here corresponds to a state in classical automata theory. We 
rather use the term state in the sense of actual state or configuration (as in FMAs [13}), 
i.e., an element of the state-space: a location together with a memory content and clock 
values. 
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with q (res. q’) the starting (resp. ending) location, v C V a set of variable 
allocations, Y C V a set of variable releases. The event e is either a symbol 
in the finite alphabet %, a use of a variable in V or an ¢. The transition 
timed constraint is y. Finally, p is the set of clocks to be reset to 0 while 
crossing the transition. To simplify the notation of transitions, the empty 
sets are omitted. 


3.2. The State Notion: Tokens 


States in timed v-automata are formed by a distribution of tokens’, i.e., 
combinations of memory and timed valuations, over given locations. Because 
the recognition principles we develop exploit the intrinsic non-determinism of 
v-automata, each location can be associated to multiple tokens, each token 
corresponding to a particular reachable state®. 


Definition 4 (Token) A token is a pair k = (D,M) with D a time zone 
representing a set of possible clock values, and M a memory valuation being 
a mapping from variables to sets of allocations. 


The timed valuation of a token is represented by a timezone D that 
encodes a set of clocks c1,...,¢n (together with a special clock co represent- 
ing the time 0) associated to the constraints about their possible values. 
Following [10] we technically represent timezones as difference bound matri- 
ces (DBMs). Because it is a rather complex aspect, in this section we only 
discuss the high-level point of view of timezones, the DBM representation is 
detailed in Section 4. 

The memory valuation M of a token is represented as a set M of 
variable allocations. 


Definition 5 (Variable allocation) For a variable X € V, an allocation 
is a finite subset of unknown symbols Ay CU, together with a flag. The flag 
may be AS (read mode, default) or AS. (write mode). In read mode, the only 
available operation is to check if an input symbol is already present in Ay. 
In write mode, the only available operation is to add to Ay a fresh symbol 


a ¢ Uvev Ay. 


“The notion of token we use is very similar to, an in fact inspired by the corresponding 
notion of high-level Petri nets. 

°In fact each state is itself a (potentially infinite) set of possible clock values corre- 
sponding to the token’s timezone. 
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Property 1 (Memory injectivity) For any pair of distinct variables X,Y 
we have Ay N Ay = 9. 


Although most memory models do not work like this, this injectivity 
property is an important feature of finite-memory automata models because 
it allows a compact representation of memory constraints (cf. [13]). This 
property, from v-automata, strengthens the memory constraint without re- 
ducing the expressibility of the model. It allows a less ambiguous description 
of patterns. 

A configuration of a hybrid v-automaton is a kind of a global state 
that encompasses a set of proper reachable states, thus expressing some 
non-determinism. Technically, the definition is as follows. 


Definition 6 (Configuration) A configuration of an automaton is a map- 
ping S' from the locations in Q to sets of tokens. We denote by S(q) the set 
of tokens associated to location q. 


The initial configuration of a timed v-automaton contains a single token 
in the initial location. The content of this token is (Do, {X — 0°|VX € V}) 
where Do is the initial timezone with all the clocks initialized at time 0. The 
memory is empty, i.e., each variable is associated to an empty set with the 
read mode flag. 

Each time an input is read a new configuration is computed from the 
previous one. The whole input sequence is accepted if after being consumed 
entirely there is at least one token in some final location of the automaton. 
This token game is explained in the next section. 


3.3. Token Game 


Given a global configuration S$ and an input a — either a time delay or an 
event (a known or an unknown symbol) — the objective is to build a next 
configuration S’ corresponding to all the reachable states of the automaton 
after consuming the input. The formal definition is as follows. 


Definition 7 (Global update) 


(S ) = Ociosurel 9; a) fae Qt (time delay) 
ie a a Oclosure(Ostep(S, @),0) otherwise (symbol) 
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In the case of a time delay the tokens should be propagated through 
the ¢-transitions, which is handled by the enabled O¢losure function presented 
below. In the case of an event, the next tokens will be produced by the 
transitions that are enabled for the input symbol. This is formalized by 
the function ostep defined below. We also use the closure function to handle 
the ¢-transitions. The “trick” is to consider that the event is recognized 
together with a time delay of 0. By the non-deterministic nature of the 
automata model, if a token enables multiple transitions then a new token 
will be generated in each location reachable by all those transitions. 


3.3.1 Event Handling 


We first consider the case of events. The time delays, a little bit more 
involved, will follow. 


Definition 8 (Event handling) 
ostep('S, 0) = {q+ {Istep(t, kia) |t= a 4 d Ak € S(q)} | € Q} 
with dstep(t, (D, M), a) = (Stime(t, D), dmem(t, M,a)) when defined. 


The ostep function simply consists in applying the local update func- 
tion dstep at all locations for all non é-transitions. This function is partial, 
only defined if the subfunction dtime returns a non-empty timezone, and dmem 
yields a value distinct from L. 

The function dtime computes the new timezone after crossing the con- 
sidered transition. It is defined as follows. 


Definition 9 (Time constraint) 
V,e,V 


Stime(q ——> YD) = (t2(7) ND) lo € 0] 

The notation tz(7y) denotes the conversion of the time constraint y to a 
corresponding timezone. We then compute the intersection of the later with 
the timezone D. In the final timezone, the clock identified by p are reset. 
These operations are formalized precisely in Section 4. 

The memory part of the next token is computed by the memory update 
function dmem from the previous memory component depending on an input 
symbol a. The computation respects the following ordering: (1) the alloca- 
tion of the variables in set v is performed, then (2) the consistency between 
the input and transition label is checked, and finally (3) the variables in the 
set 7 are released. 
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Definition 10 (Memory update) Let V be a set of variables, andU an 
infinite set of unknown symbols. 


lif e€VAaFe (¢.4,1) 
VeEeVAagu (é:1.2) 
VeEV\vAM(e) =ALAaE€ A, (c.1.8) 
V(eEevV M(e) =A8)AFY, aE M(Y) (c.1.4) 
Omem(t, M,a) = ¢ otherwise {X + ky | X eV} 
O°, if X ED (c.2.1) 
Mats (Ay Ufa})*; FX =e (¢.2:2) 
re da a Ce oe (c.2.3) 
M(X), otherwise (c.2.4) 


wheree © VUSU {e} denotes the input enabling transition t and the sets v 
and Vv denote respectively the sets of allocated and freed variables. 


In the first four cases no token can be produced. If the transition label e 
is a known symbol in ©, then the input a must exactly match otherwise it is 
a failure (c.1.1). If otherwise e corresponds to a variable, then a must be an 
unknown symbol in UY (c.1.2). A more subtle failure is (c.1.3) for a variable 
e € V in read mode. In this situation the input symbol must be already 
recorded in the memory associated to e. Moreover, if the variable e is in 
write mode (or is put in write mode along the transition), then the input 
symbol must be fresh (c.1.4). 

If the next token is produced then for each variable X the associated 
memory content Ay is updated as follows. If X is to be released (in set 7) 
then the memory is cleared and put in read mode (c.2.1). If it is not released 
and the variable is to be read (i.e., X = e) then a is added to the memory 
content (c.2.2). In (c.2.3) the variable is not read (X # e) but it is allocated 
(in set v). In this situation the memory content is put in write mode. 
Otherwise (c.2.4) the memory is left unchanged for variable X. 


a —— = = 


Figure 3: Passing a transition of the automaton from Figure 2 with input w 


Example 1 Figure 3 illustrates the generation of a new token taking as an 


x bd CY 
example the transition d = qy J, q2 in the automaton from Figure 2. 
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We are focusing here only on its memory component to illustrate dmem. Based 
on the initial memory valuation {X — {v}*} in location qm, the input w 
enables the transition d producing a new token in qo, computed as follows: 
The transition d is only enabled when the input is an unknown symbol, 
because transition d is labeled with a variable. Since the alphabet % of 
known symbols is {a,b}, the symbol w is considered as unknown, t.e., w EU. 
Because the allocations are applied before checking the input, the variable X 
is allocated and then used to enable the transition. So the symbol w should 
be added to Ay in the newly generated token. However, it is only possible if 
the input is fresh. Since Ay = {v} and X is the only variable, this freshness 
constraint is satisfied. Hence, the new token associates the memory {v, w}® 
to X. © 


X,0X input: v X,UX 
a 


Figure 4: Passing a transition of the automaton from Figure 2 with input v 


Example 2 Figure 4 presents another example of memory transition with 
another transition in the same automaton from Figure 2. Here again we 
ignore the temporal component of the automaton to focus on its memory 
component. This example illustrates a case of memory evolution with dmem- 
Here the variable X is used as the trigger and then freed. The variable’s 
freeing occurs simultaneously to reset of the clocks, after checking of guards. 
As X is not allocated during the transition and was neither allocated before, 
the transition is enabled only if the input is an unknown symbol and belongs 
to Ay. The input is actually the unknown symbol v ¢ % = {a,b}. Further- 
more, v € {v,w} = Ax, so the transition may be passed and the variable X 
is cleared in the newly generated token. © 


The important property of memory injectivity must be preserved 
through dmem to fulfill the freshness constraints. 


Proposition 1 (Preservation of injectivity) Let k be a token satisfying 
the Property 1, and suppose k’ = 6(t,k,a@) 4 1 for some transition t and 
input a. Then the token k’ will satisfy Property 1. 
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Proof: In the token k = (D,M) only the memory component M is 
impacted by the injectivity property. The main hypothesis is that k satisfy 
Property 1, it means that VX,Y (X ZY), M(X)NM(Y) =9. 


V,e,V 


Suppose that the transition t = q ae qd produces token k’ = 


6(t,k,a@) = (D’,.M’). We have to show that i satisfy the Property 1. In 
the definition of dmem (Definition 10) we are concerned with cases (c.2.1) to 
(c.2.4) because we expect a token as output. The memory update depends 
on the value of the transition trigger e and is as follows: 


e If e is not a variable: e € e UN, then the case c.2.2 of dmem cannot 
occur. So, in the token k’ the variable domains are either empty (case 
c.2.1), or the same as in k (case c¢.2.3 or c.2.4). Given the hypothesis 
that k satisfies Property 1 and the fact that @ is the zero element of 
intersection, trivially k’ satisfies Property 1 as expected. 


e If e is a variable: e € V then the case c.2.2 occurs for exactly one 
variable of the generated token. As presented above, the variable 
domains generated with the cases c.2.1, c.2.3 and c.2.4 have empty 
intersections with each other. Only the domains generated by case 
c.2.2 must be handled with care. We have to consider two situations: 


— ifa € M(e) then the set A, is not modified, so the Property 1 is 
trivially satisfied; 
— ora ¢ M(e) then case c.1.4 ensures that a@ is absent in all the 


domains of the other variables. Thus, Property 1 is satisfied as 
well. 


3.3.2. Time Delay and ¢-Closure 


We now explain the propagation of tokens for ¢-transitions and a given time 
delay x (a positive real value, potentially 0 in the case of an event), which 
is handled by the oglosure function defined below. Note that it is a closure 
function, in that whole paths of successive ¢-transitions must be considered. 
The rather non-trivial definition is as follows. 


Definition 11 (¢-closure) 


Cdosintel ©; x) = {q > Kk | qe Q} 


Pattern Matching in Link Streams: 


Timed-Automata with Finite Memory 175 
Such that 
Ala abe ; sedan EN Un ,E Un qe A, 
Y15P1 Y2,P2 Yn Pn 
K = ¢ (D',M’) ti to tn 
dk € S(qo), walk(k, [taxtosceigtn se) = (D’, M’) 


Anempty(D') A M' 4 1 


For each é-path from a location go to a location q, and for each token 
present in go in the previous configuration S, we try to generate a new 
token using the walk function. This new token consists of a timezone D’ 
and an updated memory valuation M’. An important requirement is that 
the delay x has been fully consumed at the end of the path. The function 
is partial, in particular it fails if x is consumed “too early” along the path. 
Because it involves rather complex DBM computations, the formal definition 
of the function walk is given in Section 4. 


Example 3 Figure 5 illustrates the dynamics of tokens in an €-closure 
expressed by the function Ociosure With the delay a = 4. 

The tokens in this automaton are composed of a variable X and a 
clock c. The initial configuration, in step 0, contains only one token ko 
in qo. This token is initialized with X in read mode and a set containing 
the unknown symbols u and v. There is only one clock c initialized to 0. In 


step 1, the token ko is propagated through the transition to, = qo aa M1; 
cE|0,1 


which generates the token k, in location q,. Since to, has no side-effect 
(clock or memory update), ki; is a copy of ko. In step 2, the token ky is 


propagated through the transition ty, = qi a qi generating the token kj 
cE [2,4 


in location q,. The transition has no side-effect so the memory of ki, is the 
same as the memory of k,. However, to fulfill the time constraint c € [2,4], 
the value of c has to be at least 2. To cross the transition, the clocks values 
should consume some amount of the input delay a. In step 3, both ky, 


and ki can be propagated through ti2 = q aay q2. This transition has as a 
side-effect to clear the variable X. So both the tokens ky and k generated 
respectively from k, and ki, have for variable X the value {}* (an empty set 
of symbols in read mode). Step 4 consists in the propagation of tokens ke 


0e 2, : 
and kb through the transition too = qa paisa go. This transition has as a side 
effect to allocate X. However, as tag is an €-transition, the set associated 
to X will not be modified and X will be in write mode on the generated 
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ko(X — {}°, ce 2) 


ki(X > {u,v}*,c> 0 
ki X > {u,v}*9,c > 2 kai(X — {}*®,c > 0) 
kis(X — {}°,c > 0) kos(X + {}*,c > 2) 
kV AX — {}°, 03 2) 


# tokens in qo tokens in qi tokens in q2 
0 ko:(X > {u,v}*,c > 0) 
1 ko:(X — {u,v}*,c > 0) kii(X > {u,v}*,c > 0) 
. kiilX — {u,v}*,c > 0 
2 | ko(X > {u,v}°,c > 0) — + —. 
j ‘ kii(X > {u,v}*,c > 0) koi(X > {}*,c > 0) 
3 | koi(X — {u, v}*,c — 0) R(X — {u, v}*,c > 2) K(X > {}*,6 3 2) 
5 a ST Oe See Gnu e 0): ||. ke SAt oe 0) 
0 : ki(X — {u,v}*,c > 2) kgs(X + {}*,c > 2) 
) 
) 


ko:(X > {u,v}*,c > 0) 


X — {u,v}*,c > 4) kii(X > {u,v}*,ce—> 4) 
X > {}°,c> 4) ki(X > {}°,c> 4) 


; a ko(X + {}*,c > 4) 


Figure 5: Example of ¢-closure with an input delay 4 


tokens. In step 5 two tokens are generated in location q,, but both come 
from the token ky. As kj has c > 2, it cannot enable to, because the clock 
constraint c € [0,1] is not respected. The token with c > 0 crosses to, and 
the transition ty, (as in step 2) generating two tokens, k{ and ki’, in q, with 
different clock values. After step 5 it is not possible to generate any new 
token in a location with a different value than the tokens already present in 
it. In step 6 the propagation is over and all the clocks are increased to 4 to 
consume all the input delay. 

However, only one token is kept at a location if several are generated 
with identical clock and memory valuations. The step 6 corresponds to the 
configuration returned by O closure: © 


Since there may be an infinite number of ¢-paths from a given starting 
location q, the following is an important Property wrt. decidability. 
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Proposition 2 For a given configuration S and time delay x, the function 
Oclosure can only produce a finite amount of tokens. 


Proof: To prove the proposition, we show that both the possible memory 
and clocks states are finite over the propagation through the ¢-closure. 

First, we prove that the number of memory states is finite. An e- 
transition does not read any symbol. So, the only memory operations 
present in an e-closure are the allocation v and the freeing v. Let X bea 
variable of initial valuation AY, where Ay is the set associated to X and a 
the initial mode of X. Its reachable values in the e-closure are : 


e AX in all e-paths with no operations on X, 


AX in all e-paths where X is only allocated, 


0* in all e-paths where the last memory operation used on X is a 
freeing V, 


e ()° in all c-paths where X was freed at least once and the last memory 
operation on X is an allocation v. 


As a consequence, if the tokens are composed of n variables, after the 
propagation in an e-closure at most 4” variations of each initial memory 
valuation can be generated. 

It is well known that the number of timed zones computable from an 
e-closure is finite when the clocks have an upper bound [4]. In the case 
of pattern matching this bound is the sum of all delay inputted. As the 
number of memory states and clocks states are both finite, the number of 
combinations between them is finite too. 


4 Timed Pattern Matching with DBM 


In this section we detail the core of the timed aspect of the pattern matcher. 
As explained in the previous section, it is based on non-trivial computation 
of timezones. Our approach is based on the classical model described in [10], 
which represents timezones as difference bound matrices (DBM). Unlike [10] 
our objective is not to develop a model-checking procedure but a recognition 
algorithm, hence there are many differences in the details. 
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4.1 The Clocks Representation 


As explained in the previous section, a timezone corresponds to the set 
of all clock valuations satisfying both the possible intervals and the time 
constraints in a given state of the automaton. We represent a timezone as 
a difference bound matrix (DBM), owing to the fact that the timezones 
together with the time constraints expressed in the grammar of Definition 2 
can be represented precisely by bounds on individual clocks and on the 
differences between pairs of clocks. 

A DBM to represent the timezone of the set of clocks {c1, c2,...,¢,} is 
a matrix D = {dij }o<i,j<n where each dj; is a time bound, i.e., an element 
of the set (Q x {<, <}) U {(0, <), (00, <)}. Each time bound dj; = (x, ~) 
expresses the constraint c; — c; ~ «. A DBM also requires a special clock co 
with constant value 0 used to represent the minimal and maximal values of the 
other clocks. The bounds are ordered such that d, < dz with dj = (#1, ~1) 
and dy = (x2, ~2) iff 71 < ra V (x1 = tA ~y=< A n=). We will also 
often use the minimum of two bounds, denoted by min(d;, dz). The ordering 
relation on bounds naturally extends to an inclusion ordering on DBMs. 

We denote by uv € D the fact that a valuation v = (v4, vo, ..., Un) € Q” 
(where each v; is a value for clock c) is within the timezone represented by 
the matrix D. Let D = Lite) oan: then v € D iff vu; — Uj ig Vij, 
Vi,j € [0,n] (assuming vp = 0). 

Three operations on DBMs defined in [10] are required by the pattern 
matcher. We denote by empty(D) the emptiness predicate, i.e., the Boolean 
function returning True if and only if no valuation exists in the timezone 
represented by D. The intersection of DBMs D, and Dg, i.e., the DBM 
representing the intersection of the corresponding timezones, is denoted by 
D,NDz. The canonical form [D] of a DBM D is the strongest set of bounds 
defining the same timezone as D. 

The precise definitions of the operators and notations discussed above 
can be found in [10]. For our approach, we also need a few specific operators. 
First, we define the extension of the maximal bounds of a DBM D! = 
{di;}o<i,j<n to those of another DBM D? = {di }o<i,j<n- Formally we have: 


ure 
ext(D',D*) = 4 dj; = | do, ifi=0 
otherwise min(d} d;,) 


aj? 0<i,j<n 


One way of interpreting the definition is that ext(D!,D?) constructs a 
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path of valuations from D! towards D?. 

A second operator is the reset, which sets a clock value to zero and 
updates the DBM so that the constraints are consistent w.r.t. the new value. 
Formally, we write, for a DBM D = {dj; }o<ij<n and cj, a clock: 


(0,<) if @=kAj=0)V(i=OAZ=K) 
doj ifi =k AZ #0 
dio if j =kAi #0 


Dick + 0] = 4 di; = 
otherwise dj; 


OSt,j<n 


To reset a set of clocks {c1,c...} we can write D'[c1,c2,... < 0] which is 
equivalent to D![c, < Oj[ce2 «+ 0}... 
The shift operator translates a timezone according to a time delay z. 


do +x ifi>0,j =0 


do; = 2 if1=0;9 >0 
otherwise d;; 


shift(D, 7) = ¢« d 


/ —- 
yo 


Finally, the release operator is used to remove the constraints concerning 
a clock. For a DBM D = {dj;}o<i,j<n and a clock cy: 


(o0;<) aft =k 
(0,<)ifZSkATH=0 
dio if 7 =k Ati #0 
otherwise d;; 


= ian 


As for the reset operator, we let D\ {ci, c2,...} be equivalent to D\ci\co... 


4.2 Handling of «--Paths 


The function Oglesure Of Definition 11 in Section 3 computes a next token 
while traversing a path of ¢-transitions given a time delay x as input (with 
x > 0). We now explain, in terms of DBMs, the details of this computation. 


Definition 12 (Walking an <«-path) 
walk((D, M°), [t1, ta,.--,tn],v) = ((E" M dbm(Tqn)), M”) 


where q” is the arrival location of tn, and F®,M” are obtained 
thanks to the following iterative procedure: 

(P°, W°, F°) = (D,D, shift(D, x)) 

(Gea Wi, Fi, M’) = Odiosurel tas 1 7 wi of Fi on M ot € [ks n| 
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The function walk is based on an iteration procedure, which consists in 
updating a set of three distinct DBMs: P (past),W (now) and F (future). 
When a time delay x is inputted, P represents the initial clocks valuation, 
and F represents the expected clocks valuation after the delay x, indepen- 
dently of the location of the token. The third timezone W is used to represent 
the successive clock valuations between possibly several ¢-transitions. 

The core of the walk function is the function dciosure that takes as 
parameters a transition, the three DBMs and a memory valuation. It 
produces the updated DBMs and memory valuation. 


Definition 13 (Update) Consider the transition t = q pe q', the three 
VP 
DBMs P,W and F, and M a memory valuation, then: 
Oniegurel b} (P, W, F, M)) = (ddelay(P, W, F), Omen) 


The memory update dmem is defined in Section 3 (cf. Definition 10). In 
the following, we focus on the DBM computations performed by dgelay- 


4.3. The Time Delay Dataflow 


Definition 14 (Time update) Let t = q metic qd be a transition, and the 
YP 


DBMs P,W,F, then the function ddelay(t, P, W,F) is computed according to 
the dataflow of Figure 6. This function is defined only if the whole dataflow 
procedure is executed. 


The partial function dgelay computes, from the three inputted DBMs, the 
clock valuation enabling the transition and the outgoing time valuation after 


Figure 6: The dataflow for computing ddelay. 
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the possible resets. It is structured according to the dataflow of Figure 6. 
From the source input on the left to the output on the right, the three DBMs 
P, W,F flow through a certain number of nodes. There is also a fourth 
DBM G that interprets the time constraints of the transition. This DBM is 
only used internally by the dataflow. Each node in the dataflow corresponds 
to a function that may sometimes fail to compute a value, which means 
that ddelay is in fact not defined for the given input. Put in other terms, 
the considered transition is not enabled. There are three main phases in 
the computation. First, the preparation step (nodes with prefix prep) takes 
into account the coarsest constraints to reduce the input DBMs. Then, the 
transition step (suffix trans) takes into account the time constraints of the 
transition. Finally, the reset step (suffix reset) computes the output values 
by applying the clock resets of the transition. As shown by the diagram 
arrows, there are quite intricate flow dependencies between the nodes. In 
the remaining of the section, we will present each node function in details. 
But first we introduce our running example for illustrating the definitions. 


Ss €,¢c, :=0 
C2 aera: A<@<7| 


<a <14 


Figure 7: A simple timed automaton with two clocks and two ¢-transitions. 


Example 4 We follow the crossing of the first transition on the automaton 
of Figure 7, without going into the details of the dataflow computation. The 
automaton has two clocks c, and cg, and the initial timezone is represented 
by the following DBM: 


(0, <) (-2, <) (— Es S 
DS) Uys) (0,5). (1635) 
(2,<) (5) (,s) 


We assume that the inputted time delay is 9 seconds, hence the iterations of 
walk begin with the following DBMs: 


PO =D; WO =D; FO) = shift(D, 9) 
Now, we let (PY, wh FO MO) =Oyalay Ut (PO), Ww), F())) for 


the first transition ty = qo Ee, qi. The updated timezones are the 
Sc2s 
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following : P®) =P© and FY = FO) and 


(0, ss) (—4, <) (—4, <) 
Ww =| (22,<) (0,<) (16,<) 
(655). (Oye)! 40.) 


The further examples will detail the crossing of the second transition from 
this first application of ddelay- 


Preparation phase The node prepg of the dataflow computes the DBM G 
corresponding to the global time constraint. It is defined as follows: 


prepa (q + ') = [ta(y) nt2(P4) 9 (tz(Ty) \ 0) 


The DBM G is the time zone corresponding to all the constraints the clocks 
have to satisfy in order to enable the transition from q to q’. The constraints 
from the timed constraint of gq’ concerning the reset clocks are released 
because the value enabling the transition are not the ones entering q’ for 
this clocks. 

In the preparation phase, we also need to filter out the valuations that 
would contradict the global constraint. We remove from P and W the 
valuations with at least one clock value above the maximal value of the same 
clock in G. Symmetrically, the invalid valuations of F are those with at 
least one clock below the minimal value in G. More formally, we have: 


prepy (W, G) = [fe = | min(wio, gio) if 7 = 0 } 


otherwise w;; 


min(fij, 9:3) if 7 #0 
prepp(F,G,P) = ij = | fio — pio + gio if 7 = 0, gio < pio 
otherwise fj; 0<i,j<n 
Note that the timezone P and F are linked as they represent respectively 
the initial and final valuation of the clocks. If one is changed then the other 
must also be updated. However, we do not need to add a preparation node 
for P because it is not needed for the next phase, moreover the computation 
would be redundant. However in the example below we will show how P 
and F are synchronized for illustration purpose. 
All the DBMs computed in this phase and most of those computed in 
the following phases are canonicalized (using operator [-]). This ensures 
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0 2 4 6 8 10 12 14 16 18 20 22 24 26 


Figure 8: Illustrating the preparation phase of ddelay applied on tz from the 
automaton of Figure 7. 


that no precision loss may be propagated during the computation. Only P 
is not canonicalized as it is synchronized on F, which is already in canonical 
form. Furthermore, the only values used in P are the individual clocks’ 
intervals (pjo9 and po; for all i, 7). 


Example 5 Figure 8 illustrates the preparation phase of 
Sdelay(t2, PY), WO), FO). Since the automaton only uses two clocks, 
the DBMs can be represented as polyhedra in a 2-dimensional space. The 
global constraint G = prepg(t2) is depicted as a rectangle in the middle 
of the figure. The initial DBMs P®),W®, FO are depicted as polyhedra 
outlines. The filled zones correspond to results of the preparation functions. 

First, we define W' = prepy(W), G), which removes the right part 
of W®) (shown as a barred area). Indeed, each valuation of W) where 
cy > 14 must be filtered out as 14 is the upper bound of c, in G. In the case 
of the “future” DBM we define F’! = prep,(F,G,P®)). The valuation 
of FO) where c, < 11 must be removed because the lowest value of c, in G 
is 11. This corresponds to the barred area on the left of F’ on the figure. If 
we actually computed the current update for P (the filled zone at the bottom 
of the picture), then we would have to remove the corresponding barred area. 
Symmetrically, the barred area on the right of the updated P is also removed 
on F" side®. 


° As already explained we do not have to actually compute the update timezone of P. 
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The canonical form of both W' and F’ is computed to get the most precise 
values for the two-clocks constraints, i.e., the constraints corresponding to 
pairs of clocks (the diagonals in Figure 8). For F’ we get -1 < cy —cg < 13. 
And for W' we get 0 < cy — cg < 10. At the end of the first step we have: 


Transition phase In the next step, we actually “enter” the transition 
by first updating the timezones so that it is enabled. Moreover, the values 
unreachable after the transition are filtered out. We first consider the update 
of W, as follows: 


transy (W, F, G) = [(ext(W, F) 1 G)] 


This function computes the reachable clock valuations consisting in ex- 
tending the elements of W toward their final position in F. In the extension 
only the elements satisfying the global time constraints G are preserved. 
If the result of transy is an empty DBM, this means a time constraint is 
not satisfied, thus the transition is not enabled (and the whole dataflow 
execution fails). 


i / “eee 
transp(F, W’) = S = { min(w;;, fiz) if 1,7 40 } 
OSi,jxn 


otherwise fj; 


pig — fig + Fig 
transp(P, F, F’) = Di = if (i =O0V 7 =0) A pio F (0, <) 
otherwise pj; 0<i,j<n 
The two-clocks time constraints of the DBM returned by transyw are 
applied on F to remove its unreachable values. The result is put in canonical 
form because the actual maximal and minimal values of each clock must be 
known before the next phase. The function transp is used to synchronize P 
with the DBM returned by transr. As depicted on the dataflow (cf. 6), the 
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c2 
12 
10 KF 
8 KAA WO 
WS 
EW 
Ke 
WA 
2 pi) 


O 12 14 16 18 20 22 24 26 


Figure 9: Illustrating the transition phase of dgelay- 


parameter F is the initial DBM whereas F’ is the output of transr. The 
function removes from P the area of F absent in F’, which we illustrate in 
the next step of the example. 


Example 6 Figure 9 represents the transition phase of ddelay. At the 
beginning, the DBMs W' and F’ resulting from the preparation phase, 
together with P™ are represented by their outline. The DBM W" = 
transy (W’,F’,G) is represented as a filled area, which corresponds to the 
extension of W' towards F’ (the dotted area on the figure), intersected with 
the global constraint G. 


ext(W’, F’) Gr Ww’ 
(0, <), (-4, S), (-4, S$) (0; <), (1255); (5yS) 
(23, <), (0, <), (10;:<) 9 Ge (14, <), (0, <), (9, <) 
(1s) (15) (0, 3) (8, <), (—4, S), (0, <) 


Once W" is computed, we can define F” = transp(F’,W"), which 
consists in removing the valuations of F’ that are not reachable from W". 
Finally, the computation of P! = transp(P™, FO), F”) consists in compar- 
ing the definition interval of each clock in F and F”, and subtracting the 
difference to the bounds on P®), 


(0, <), (14, S), (-10, <) (O35 )(—5,S)(=18) 
EB’ = (20,<); (04), (9e=) PPS |) Ca (0; = 16s) 
(11, <), (-4, <), (0, <) (2, <), (0, SOS) 
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Reset phase The final step consists in resetting the clocks present in p. 
The three nodes of this part of the dataflow produce the outputs of ddelay. 
For the timezone W the computation is straightforward: 


reset (W, p,q’) = [W[p < 0] Ntz(L,)] 


The resulting timezone corresponds to the clock valuations entering the 
arrival location gq’. These are the valuations of W where the clocks of p are 
reset, and such that the invariant Ty is satisfied. 


The computation performed by the node reset is a little bit more 
involved. The definition is as follows: 
reset (F, p, W, W’, P) = 
min({ fro + wogll < k <n} UL fro — Proll < k < n}) 
if c; € p,cj = Cp 
, | min({(0,<)}U {fon + wholl < & < n}) 
y if c; = co, cj € p 
Wij if i7 0 


otherwise fj; 0<i,j<n 


The function needs both the DBM W outputted by transyy and W’ as re- 
turned by resety. It returns the timezone representing all the valuations 
reachable in the destination location q’ independently of the timed constraint 
of q'. To compute it from F (result of trans) we have to find the intervals 
of definition for all the reset clocks, and then restrict the timezone with 
the time constraint of W’. All the reset clocks have the same maximal 
and minimal valuations: Vi,j € p, fio = fio \ foi = foj;- This value is the 
maximum (resp. minimum) distance between the values in W and their 
corresponding final position in F. We have to make sure that this distance 
is not greater than the maximal distance between a point of P (from transp) 
and the corresponding point in F. The non-reset clocks keep their previous 
maximal and minimal values. In case the resulting DBM is empty, the 
dataflow execution is considered failed. 


Finally, resetp will generate the timezone corresponding to P after the 


resets. 
reset p(P, p, F, F’) = 


(0, <) if @=0,j Ep) V (ie p,j =0) 
Diy = | Pi — fig + fiy fi = OV J =0,pi0 F (0, S) 
otherwise Pij 0<i,j<n 
The DBM F is the result of transp and the DBM F’ is the result of reset. 
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0 2 4 6 8 10 12 14 16 18 20 22 24 26 


Figure 10: Illustrating the reset phase of dgelay- 


All the clocks of p have their maximal and minimal values set to zero. A 
clock of P of value zero is only used by resetr to get the upper bound on the 
maximal value of the reset clocks. For the non-reset clocks, their maximal 
and minimal values are restricted in order to keep the corresponding bound 
with the ones returned by resetr, as required by the preparation phase of 
the (potential) next transition. This is done by comparing them with the 
bound of F (from trans). 


Example 7 Figure 10 depicts the final reset phase, the computation of the 
zone outputted by ddelay. The zones resulting from the transition phase are 
depicted by their respective outline. The timezones W) = resety (W”", p,q’) 
and P’) = resetp(P, p, F, F’) are flattened on the cz axis because the clock cy 
is reset. The timezone F®) = resetz(F", p, W", W°), P’) is depicted by the 
filled area on the picture. We first compute the maximal and minimal allowed 
values for each clock. The non-reset clocks keep the same bounds as the 
ones from F". For the reset clocks, we have to compute the remaining time 
interval represented on the picture with the arrows |r, R]. This correspond to 
the minimal, resp. maximal, distance separating a valuation of W" to its 
corresponding “future” in F”. However, the maximal value cannot be greater 
than the distance separating an element of P’ of its corresponding element 
in F” to avoid a situation in which the remaining time is greater than the 
initial delay. The two-clock relations are the ones from W). Finally, its 
canonical form is computed to have the exact bound of values for each clock. 
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(0, <), (0, StS) (0, <), (0, S).(—9; >) 
p(?) = (0, =); (0,5); (16, <) ;w?) = (0, =); (0, =), (=o; 5) 
(2, <), (0, <), (0, <) (8, <), (8, S), (0, S) 

(0,5), (=2,5),(=10,5) 

F?) = | (6,<),(0,<),(-5,S) 


5 Pattern Language and Experiments 


5.1 Pattern Language 


The description of non-trivial patterns in link streams can become tedious 
if specified directly as automata. Indeed, even simple patterns can yield 
very large automata. We are looking for a more concise way to describe the 
patterns, in the spirit of regular expressions. We propose the language of 
timed v-expressions to specify patterns for link streams. 


Node MMs Nisas B= hk (known node) 
xX (variable, unknown node) 
@ (arbitrary node) 
Expression ¢,¢),€2,... = 0 (node) 
ny > neo (link) 
(regular) €1 + €2 (concatenation) 
e1 | €2 (disjunction) 
€1 @ €2 (shuffle) 
e* terstion) 
(time) (2)izal (delay’) 
(memory) t{X1,...,Xn}e (allocation) 
e{X1,...,Xn}! (release) 


Table 1: The (core) pattern language 


The syntax of the core constructs is given in Table 1. The basic con- 
structs are those of traditional regular expressions. The symbols are referring 
to known, unknown or arbitrary nodes. The link construct n1 — n2 describes 


"Following [3] the expression inside a delay should not be empty. 
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a non-breaking connection between two nodes. The delay construct for time 
constraints is the same as in [3]. The constructs for memory management 
are based on variable occurrences (for unknown nodes), allocations and 
releases. The notation {{X1,...,Xn}e (resp. e{X1,..., X,}!) means that 
the variables X1,...,X, are allocated (resp. released) before (resp. after) 
recognizing the subexpression e. The shuffle operator ® is present in the 
language to ease the description of patterns with independent parts. 


The semantics of the pattern language is given in terms of a generated 
timed v-automaton. A special case is the link expression n, — ng that 
corresponds to a basic automaton with three locations and two transitions 
in a row, one for n, and the second for ng. One important property is 
that this construction is non-breaking (e.g. it is atomic for the shuffle). 
Note that the translation is relatively straightforward. The translation rules 
for the regular expression constructs are the classical ones. The function 
aut: expression — > automaton translates a timed v-expression to the 
corresponding timed v-automaton. 


Figure 11 illustrates the translation for some notable operators of the 
language from [3]: delay and concatenation, which are impacted by the time 
component. 


To translate the delay operator (e);, we first need to generate the 
automaton of the constrained sub-expression e. Then we create a new 
clock c dedicated to measure the time for the new constraint. Finally, all 
transitions to a final location of the automaton have their timed constraints 
strengthened with the constraint c € I. 


The timed aspect of the concatenation operator e1-e2 consists in resetting 
all the clocks in order to initialize the checking of the timed constraints. 
Its translation is mostly the same as for regular expressions: each of the 
sub-expressions, e€; and ég, is translated to an automaton (resp. A, and A2) 
and new automaton A is created containing all the locations of A, and Ag, 
all their transitions and clocks. The initial location of A is the initial location 
of A, and its final locations are those of Ay. Moreover, for each transition 
of A; going to one of its final locations, the equivalent transition but with 
the initial location of Ag as destination and resetting all the clocks of C is 
added to A. 


Figure 12 illustrates how the allocation and release operators are trans- 
lated. The translation of f{X1,..., Xn}e gives rise to a new initial location q 
and a ¢-transition between qj and the initial location of the automaton gen- 
erated from e, which allocates the variables X1,...,Xyj. The new initial 
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1,6, Vy 


NAcET, pi 


aut(e) 


Figure 11: Automata for the delay and concatenation operations. 


location of the automaton is qj. The translation of e{X ,..., Xn}! leads to 
the creation of a new final location qf and a new transition from each final 
location of the automaton generated for e to qf, each of them releasing the 
variables X1,...,X,. The new unique final location is q,. 

In the experiments we often used the following derived constructs: 


def 


e allocation and use: 1X = f{X}X 


def 


e use and release: X! = X{X}! 


e allocation, use and release {X! = t{X}X{X}! 


5.2 Experiments 


Our main objective is to develop a practical pattern matching tool for link 
stream analysis. An early implementation of the tool is available online’. In 
this section we present early experiments with this prototype to real-world 
link streams. 


8The MaTiNa tool repository is at: https://github.com/clementber/MaTiNA 
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Xn te) 
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aut(e{X1,..., Xn}!) 


Figure 12: Automata for memory operators 


For starters, the worst-case complexity of our pattern matching algo- 
rithm is exponential on the size of the link stream (the number of links). This 
complexity is reached for instance in the case depicted in Figure 13, which is 
a”memory-only” scenario. If the input is a sequence of distinct symbols then 
the number of tokens associated to the unique location of the automaton 
will double each time a symbol is consumed. For instance, in the Figure, 
the 8 tokens are associated to distinct versions of the variable U (the U;’s) 
after consuming the input a bc: one for each subset of the alphabet. 


ia 

Up U, Ug U3 Us Us Usp Uz 
Goo BAe g ees : 
VN.NUN 


Figure 13: A subset automaton after input a 6 c. 


However, timed constraints most often improve the situation by remov- 
ing expired tokens. Thus, in practice there are ways to avoid the worst-case 
scenarios. This is similar to the practical ”regex” tools, which in general go 
well beyond regular expressions, also leading to exponential blowups in the 
worst case [6, 7]. 

This makes experimental evaluation of our method particularly appeal- 
ing to estimate its practical performances and applicability. In order to do 
so, we consider two link streams built from two different real-world datasets: 
(1) a recording of traffic routed by a large internet trans-Pacific router [11], 
and (2) a one month capture of tweets on Twitter France. 

In the case of internet traffic, our motivation is to detect potential 
coordinated attacks. To do so, we define a variant of the triangle pattern 
discussed in section 2, namely 2x2 bicliques, i.e. squares, which [20] identified 
as meaningful to this regard. Since there is approximately one link every 2s 
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Figure 14: DDoS pattern recognition with time frames 6 = 0.01 (top) and 
5 = 0.02 (bottom). 
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Figure 15: Triangle detection in Twitter exchanges with running time and 
number of detected instances (top) and local running time (bottom). 
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in the stream and the stream lasts for a whole day, it must be clear that we 
may not detect all untimed patterns in the stream. In this context, the time 
frame of an attack is in general quite sudden and precise, and so time is a 
crucial feature. 

We present results for two different time frames in Figure 14. It displays 
the total running time as a function of the number of processed links, together 
with the number of found instances of the pattern. As expected, the number 
of instances of the pattern increases with the time frame. Also, the tool 
processes less links in a given amount of time (85 hours in this experiment). 
Although our implementation is not optimized at all, the linear time cost of 
the computations w.r.t. the number of processed links clearly appears. 

Our second experiment targets communities of Twitter users. We 
consider tweets over a period of a month, leading to a stream of 1.3 million 
links °. The pattern we seek is an undirected complete graph between k users 
for a given k, i.e. cliques of size k occurring in a time frame of ten minutes. 
Figure 15 presents the results for & = 3, i.e. triangle detection. The running 
time experiences sharp increases at specific times, that correspond to peak 
periods in Twitter exchanges. This is confirmed by plotting the execution 
time at each step of the computation (right part of the Figure). During such 
peaks of tweets, the tool has to store more data than usual, leading to a 
more costly processing of links. One way to improve this issue would be to 
consider a variable time rate by e.g. decomposing the link stream in distinct 
sub-streams processed with different time frame. 


6 Conclusion 


The language of timed v-expressions we propose to specify patterns in link 
streams is heavily inspired by regular expressions, but enriched with timed 
and memory features. The language is rather low-level but with well-chosen 
derived constructs we think it is usable (and has been used) by domain 
experts. The language has a straightforward translation to the core outcome 
of our research: the timed v-automata formalism and the corresponding 
recognition principles. Compared to the conference paper, this extended 
version presents a generalized version of the time component of timed v- 
automata. The clocks are now represented using timezones, modeled as 
Difference Bound Matrices. Using them, we have formalized the timed 


°The data come from the Politoscope project by the CNRS Institut des Systémes 
Complexes Paris Ile-de-France (https: //politoscope.org) 
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pattern matching dynamics. This extension allows in particular unrestricted 
resets on €-transitions and enhances the language expressiveness. However, 
in order to fully exploit in practice this semantical extension it would be 
necessary to add some dedicated operators in the syntax of the language. 

Beyond the formalities, we developed a functional, and freely available, 
prototype that we experimented in a realistic setting. Non-trivial patterns 
have been detected on real-world link streams, with decent performances 
for such an early prototype. These early experiments give us confidence 
regarding the relevance of our approach. 

For future work, we plan both theoretical investigations and more 
practical work at the algorithmic and implementation level. We also expect 
to broaden the application domains. In particular, since our detection is 
performed online, one potential area of application is that of monitoring open 
systems at runtime for e.g. security or safety properties. At the theoretical 
level, we plan to study the pattern language and its more precise relation to 
the automata framework. Since the semantics are based on a token game, 
the formalism is in a way closer to the Petri nets than it is from classical 
automata. Hence, interesting extensions of the formalism could be developed 
based on a high-level Petri net formalism, e.g. in the spirit of [12]. Our 
prototype tool uses a relatively naive interpreter for pattern matching. We 
plan to improve its performances by first introducing a compilation step. 
Moreover, there is an important potential for parallelization of the underlying 
token game. 
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